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1.   DATA  COMPRESSION  FOR  CHARACTER  STRINGS 

1.1.   The  Necessity  of  Compression 

In  the  field  of  data  communications,  one  recurring  problem  is  how 
best  to  transmit  the  most  information  in  the  least  time  and  at  the  least 
cost.  Within  the  scope  of  a  centralized  computer  system,  the  problem  is 
often  solved  by  "hardwiring"  the  devices  which  wish  to  communicate.  As 
the  number  and  speed  of  the  peripherals  increase  (by  design  changes  or 
equipment  upgrading),  previously  adequate  data  paths  begin  to  show  signs  of 
saturation  and  new  techniques  must  be  employed.   Three  popular  techniques 
are:   (l)  increasing  the  bandwidth  of  the  data  path  to  achieve  a  greater 
parallelism  in  the  architecture;  (2)  switching  the  mode  of  operation  of  data 
path  controllers  from  bit- serial  to  bit-parallel,  with  an  appropriate  increase 
in  hardware;  and  (3)  multiplexing  a  single  high-speed  line  among  many  slower 
speed  devices.  While  these  techniques  work  adequately  well,  they  are  some- 
what dependent  upon  a  controlled  environment  in  which  device  speeds  are 
well-matched,  data  paths  are  relatively  short,  error  rates  are  small,  and 
the  equipment  involved  is  well-understood.  But  as  one  leaves  the  central 
system  and  begins  to  explore  the  world  of  remote  computing  and  remote 
conversational  access  in  general,  a  new  series  of  problems  and  woes  arises. 
Primary  among  these  is  the  basic  inadequacy  of  the  common  or  "voice-grade" 
telephone  line  for  carrying  information  at  high  speed  with  acceptably  low 
error  rates . 

The  rate  at  which  a  state-of-the-art  minicomputer  can  produce 
information  now  far  outstrips  the  capabilities  of  the  voice-grade  line.  Yet 
the  alternative,  a  "conditioned  line, "  which  is  capable  of  high  speed 
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operation,  is  not  only  expensive  to  install  and  costly  to  use,  but  denies 
the  basic  advantage  of  telephone  circuitry  as  a  data  path:   the  ability  to 
communicate  from  one  point  to  another  point,  however  unlikely,  provided 
each  end  has  access  to  a  common  household  telephone. 

The  problem  gets  completely  out  of  hand  when  one  considers  more 
ambitious  projects  such  as  remote  graphics.  A  machine  with  the  size,  speed, 
and  capacity  of  an  IBM  360/75,  talking  to,  say,  a  Calcomp  plotter,  generates 
literally  millions  of  instructions  to  control  the  plotting  of  even  simple 
graphs.  Drawing  a  10"  straight  line  at  a  U50  angle  requires  some  2000 
commands,  all  of  which  must  be  transmitted  over  some  data  link.  The  expected 
result  is  that  remote  graphics  is  unbearably  slow,  and  when  in  use  tends  to 
completely  saturate  the  data  path,  thus  effectively  blocking  use  of  the  line 
for  any  other  device . 

1.2.  A  Software  Solution 

It  is  this  impasse  which  prompted  research  on  the  subject  in  question: 
if  line  speed  is  constrained  to  some  constant  value,  how  can  the  density 
of  information  transmitted  best  be  increased?  The  answer  appears  to  involve 
manipulation  of  the  transmitted  data  itself,  and  in  some  cases  the  addition 
of  pre-processors  and  post-processors  to  encode/decode  the  transmitted  infor- 
mation. While  this  solution  also  requires  additional  hardware,  the  "intelligence" 
required  is  small  and  well  within  the  range  of  minicomputers  and  even  micro- 
computers (processor-on-a-chip  technology).   Such  hardware  is  not  expensive: 
one  popular  8-bit  microprocessor  sells  for  $60  in  quantity.   In  fact,  the 
utilization  of  microcomputers  can  be  shown  to  represent  good  economy  when 
compared  with  the  "conditioned  Line"  alternative.  Thus,  the  hardware 
capabilities  for  data  compression/decompression  exist,  and  what  remains  is  to 
determine  what  software  methodology  should  be  applied. 
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2.   DATA  COMPRESSION 

2.1.  The  Transmitted  Message 

The  purpose  of  this  project  was  to  determine,  for  a  specific 
class  of  "messages"  which  exist  initially  as  print  files  on  disk,  how  that 
file  should  "be  pre-processed  locally,  transmitted  across  telephone  circuitry, 
and  post-processed  at  a  remote  site  so  as  to  minimize  the  number  of  char- 
acters actually  transmitted  (and,  hence,  telephone  charges).   The  boundary 
conditions  of  the  problem  were: 

1.  All  information  to  be  transmitted  was  basically  of 
the  same  type:   relay  ladders  for  process  control 
logic  (i.e.,  a  graph).   The  following  pages  show 
the  sequence  of  equations,  object  code,  and  graph 
which  are  to  be  transmitted.   Only  the  graph  is  of 
sufficient  length  to  justify  condensation. 

2.  The  graphs  were  stored  as  line  images,  1  to  80 
characters/line,  and  varied  in  length  from  30  to 
1500  lines. 

3-  There  was  similarity,  and  indeed  duplication,  of  the 
basic  "building  blocks"  within  each  line  and  further 
repetition  of  these  building  blocks  among  lines  of 
the  graph. 

It  seemed  reasonable  to  assume  that  a  detection  of  common  phrases 

of  the  graph,  and  their  reduction  from  the  physical  character  string  itself 

to  either  the  character  and  a  repetition  factor  or  to  a  pointer  into  a 

"dictionary"  of  common  phrases,  was  a  reasonable  approach.   The  anticipated 

result  was  a  significant  decrease  in  the  total  length  of  transmitted  text. 

2 .2 .  A  Solution  Utilizing  Duplicate  Character  Compression 
2.2.1.   The  Encoding  Algorithm 

Tf  a  string  of  two  more  more  contiguous  characters  in  the  message 
are  identical,  replace  the  string  with  a  two-character  phrase  where  the  first 
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********  ***************** EOU AT  I ON S** ************************** 


CR1  =  (PBi+CRl)*PB2 

MX1  =  CR1 

CR3  =  MX1  £  PB4-NC*TR5-XX0 

CR4  =  MX1  £  (PB4-NC+TR5-00X)*CR5 

CR5    =    MX1    £    <CR3*/CR4+CR5)*(CR3+/CR4)*/CR7 

CR6    =    MX1    £    /CR5+CR6*CR7 

TR1    =    MX1    £    CP6 

CR7   =    MX1    £    (LS2+CR7*CR8)*CR6 

CR8    =    MX1    £    CR6*TR1-00X*LS3*TR4-XXG 

CRIO    =    MX1    £    LS6*/CR13*/CR12*/CR5 

CR11    =    MX1    £    (PS1+CR11)    */CR12*CR10*/CR5 

SOLA    =    MX1    £    /PS1*/CR11*LS1 

CR12    =    MX1    £    (PSl+CRll)    *FAULT+CR12*PB5 

TR3    =    MX1    £    WC 

WC    =    MX1    £    CR11*LS2 

CR13    =    MX1    £    (CR13+TR3-OOX)*/CR5 

TR4   =    MX1    S    CP13 

CR2    =    MX1    £    (CR2*/CR7+TR4-00X*CR5)*SS1 

TR5    =    MX1    £    CR2 
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character  is  both  a  phrase  marker  and  a  count  of  the  number  of  duplicate 
characters  and  the  second  character  of  the  pair  is  the  normal  bit-code  of 
the  repeated  character . 

Thus  the  string:      ABBBCBBBBDEEF  length=13  (*) 

reduces  to:                   A(3)BC(U)BD(2)EF  length-10 

while  the  string:             XYYYZXYYYZ  length=10  (**) 

reduces  to:                  X(3)YZX(3)YZ  length=8 

Note  that  in  string  (*)  the  replacement  of  'EE  '  with  '(2)E* 
represents  neither  a  saving  nor  a  loss  with  regard  to  the  actual  number 
of  characters  transmitted;  the  larger  common  phrase  'XYYYZ '  is  not  detected 
in  string  (**)  because  of  the  contiguous  duplicate  character  requirement. 

Nevertheless,  the  method  shows  promise  and  features  simplicity 
of  the  encoding/ decoding  mechanism  as  illustrated  in  the  following  Knuth-style 
description  of  the  algorithm. 

Step  1  (initialization) 

F  «-  first  character  of  character  string  MSG 
S  «-  second  character  of  MSG 
i  «-l 

L  «-  total  length  of  MSG 


Step  2  (longest  contiguous  string  of  Fs) 
while  (F=S  &  j  <  L)  do 


I 


d  *-  J  +  i 

•th 


character  of  MSG 


Step  3  (output) 


if  (j-i)=l  then  output  (F) 
else  do 


I 


output  ( j  -  i ) 
output  (F) 
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Step  h   (halt?) 


if  j  >  L  then  halt 


Step  5  (scan  remaining  characters) 


J  «-  J  +  1 

F  «-  S 

th 
S  <-  j   character  of  MSG 

go  to  step  2 


2  .2  .2  .  The  Decoding  Algorithm 


The  decoding  algorithm  is  equally  simple 


Step  1  (initialization) 


F  «-  first  character  of  encoded  string  MSG 
S  <-  second  character  of  MSG 
i  <-2 


Step  2  (marker?) 


if  F  is  a  marker  then  do 


else  do 


output  (F) 

i  <-  i  +  1 

if  i  >  L  then  do  C 


output  (S) 
]  halt 


F  «-  S 

S  <-  i   character  of  MSG 

go  to  step  2 

output  F  copies  of  S 
i  «-  i  +  1 

if  i  >  L  then  halt 

th 
F  «-  i   character  of  MSG 

i  <-  i  +  1 

if  i  >  L  then  do 


J  output  (F) 
|  halt 

5  <r-  ±~"   character  of  MSG 

o  to  step  2 


th 
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Several  examples  showing  the  operation  of  the  two  algorithms, 
and  calculating  the  number  and  percentage  of  characters  saved,  are  shown 
next.  The  first  program  is  a  PL/l  version  of  the  encoding  algorithm, 
followed  by  five  sample  runs  using  data  picked  at  random  from  sample  graphs, 
followed  by  the  program  for  the  decoding  algorithms,  followed  by  a  sample 
run  showing  the  text  expansion  back  into  the  original  string. 

It  is  interesting,  and  perhaps  surprising,  to  note  that  this 
rather  simple-minded  procedure  produced  transmission  savings  of  67.6^, 
68.6^,  56.7%,  l^-Jfo,    and  71. 7$,  respectively,  for  the  five  examples  chosen 
from  this  general  class  of  messages.   The  contiguous  duplicate  character 
compression  algorithm  shows  promise  based  on  its  reasonable  efficiency 
and  simplicity. 

2.3-   A  Solution  Utilizing  Common  Phrase  Detection  and  Replacement 
2.3.I.   Common  Phrase  Detection 

As  noted  in  a  previous  example,  the  requirement  of  contiguous 
duplicate  characters  prevents  the  recognition  of  duplicated  blocks  of 
non-identical  characters.  For  the  class  of  graphs  studied,  it  is  obvious 
that  the  strings  '--(   )--',  '--(/)--',  '         ',  and  others  are  used 
repeatedly  to  build  the  individual  lines  of  the  graph,  and  not  much  compaction 
is  permitted  by  the  previous  algorithm  when  applied  to  these  strings.  This 
immediately  introduces  a  new  question:   for  some  string  S,  does  there  exist 
a  set  P  of  common  phrases  of  S  whose  repeated  use  in  S  could  be  replaced 
with  pointers  to  a  dictionary  of  phrases  P?  Clearly,  the  answer  is  yes  and 
such  an  algorithm  could  be  implemented  ad  hoc,  given  such  a  set  of  phrases  P. 
But  a  more  interesting  question  is:   for  some  string  S,  does  there  exist  a 
set  P  of  common  phrases  of  S  whose  replacement  in  S  by  pointers  to  P  yields 
a  minimal  length  new  string  S'? 
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/*  TEXT  COMPRESSION  BY  DUPLICATE  CHARACTER  COMPACTION  */ 

/*  ENCCDING  ALGORITHM  */ 

/*    COMPUTER  SCIENCE  389  PROJECT  */ 

/*  ALFRED  C .  WEAVER  */ 


/*  AN  ALGORITHM  TO  REDUCE  THE  LENGTH  OF  A  TEXT  FILE  BY  COMPACTING  */ 
/*  SUCCESSIVE  DUPLICATE  CHARACTERS  INTO  A  SIGNAL  BYTE,  FOLLOWED  */ 
/*  BY  A  COUNT  BYTE,  FOLLOWED  BY  THE  CHARACTER  ITSELF  */ 

COMP:  PROC  OPTIONS(MAIN); 

/*  «MSG'  IS  THE  INPUT  TEXT  STRING  TO  BE  REDUCED  */ 

DCL  MSG  CHAR(82)  VAR,  ( FIRST, SECOND )  CHAR(l),  CH  CHAP(3); 

/*  WHEN  INPUT  IS  EXHAUSTED,  PRINT  THE  STATISTICS  FOR  */ 
/*  THIS  PARTICULAR  MESSAGE  */ 
ON  ENDFILE{ SYSIN)  BEGIN; 

PUT  SKIP(2)  EDIT  (•STATISTICS:*,  •ORIGINAL  MESSAGE  LENCTH:', 

LC,  » REDUCED  MESSAGE  LENGTH:',  LR ,  • SA VI NG: • ,  LO-LP , 

•  (=•,  (LO-LRHIOO/LO,  •  %)*     ) 

(A,  3 (SKIP,  A,  F(10)),  A,  F(5,l),  A); 

PUT  PAGE; 

STOP; 

END; 

/*  'LG«  IS  THE  ORIGINAL  LENGTH  OF  THE  MESSAGE  */ 
/*  'LP'  IS  THE  LENGTH  OF  THE  REDUCED  MESSAGE  */ 

LO,LR=0; 

DO  WHILE( 'I'D) ;  / 

/*    READ,    PRINT,    AND    COUNT    THE    INPUT    MESSAGE    */ 
GET    LIST     (MSG); 

PUT    SKIP(2)    EDIT    ('ORIGINAL    MESSAGE:',     •••',    MSG,     "••) 
(A,    C0L(30),    3    A)  ; 
LC=LC  +  LENGTH(MSG»  ; 
PUT    SKIP    EDIT     ('REDUCED    MESSAGE:',     •••»)     (A,    C0L(30),     A); 

/*     'FIRST'     IS    THE    FIRST    CHARACTER    OF    THE    INPUT    STRING    */ 
/*    'SECOND'     IS    THE    SECOND    CHARACTER    OF    THE    INPUT    STRING    */ 
MSG=MSG||'        '; 

1=1;       FIRST=SUBSTR(MSG,1,1); 

J=2;       SEC0ND=SUBSTP(MSG,2,1); 

/*  REPEAT  UNTIL  THE  ENTIRE  PHRASE  HAS  BEEN  EXAMINED  */ 
DC  WHILE  (SUBSTP(MSG,  I)  -=  "); 
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/*    FINC    THE    LONGEST    STRING    OF   CHARACTER     'FIRST*    */ 
OC   WHILE    (FIRST=SECOND    £    J<=80)5 

j=J+l; 

SECCND=SUBSTR(MSGi Jtl); 
END; 

/*  IF  J-I*lf  THERE  WAS  NO  DUPLICATE  CHARACTER,  SO  OUTPUT  IT  ALONE  */ 
IF  J-I=l  THEN  DO; 

PUT  EDIT  (FIRST)  (Ail)); 

LP=LR+l; 

END; 

/*    OTHERWISE    PUT    OUT    A    REPETITION    MARKER    (="*"),    FOLLOWED    BY    */ 
/*    THE    COUNT    FIELD    (REPETITION    FACTOR),    FOLLOWED    BY    */ 
/*    THE    REPEATED    CHARACTER     ITSELF    */ 
ELSE    DO; 

PUT    STRING(CH)     EDIT     ( J- 1  )    (F<3)>; 
CH=TRANSLATE(CH, '0', •     •); 
PUT    EDIT    (»?*,    CH,     FIRST)     (3    A); 
LR=LR+2; 
END; 
I=j; 
J«J+l; 

FIPST=SECOND; 
SECOND=SUBSTR(MSG, J,l); 
END; 
PUT     EDIT     ('•«•)     (A); 

end; 

END    CCMp; 
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CRIGir.AL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCEO  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCEO  MESSAGF: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGF: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCEO  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 


PP1' 
3S004    P61« 

0— (  )--+• 

CS5002-U0G3     )S002-+« 

:  001         :  • 

:*003    S00201S5003    :' 


:?009    :• 


:         CR1         :  PB2 

:?003    CR1S003    : S5003    PB22047    CR1« 


+__(  )__  + (  )__h __ „. 

+55002- U003     )SOO2-+*0O2-C8003    )  S002-+X042-  (  S5002030021 >-  +  • 

:         177  002         : 

:*003    1S0027S007    S00202SS003    :*044    1*0027' 


:?0  19    :' 


?019    :%0*3    MX1* 


:*019    ♦2042-(?002012)-+' 


55065    162' 

MX1' 
35004   MX1« 

0— (  )--+• 

0X002-  U003    )S5O02-*« 

162        :  • 
35004    162*003    :• 


J5010    :  • 


*010    : • 

+ +  i 

+S5009-  +  ' 

:  • 

•  i 


•:       PB4-NC         TR5-XX0 
-lk- 


Cftl« 
(0011)-+' 
177' 

MXi' 
(0012)-*' 
162" 


CR31 


RECUC  ED    MtS?MiE: 


•:I£002    P6^-NCS003    TP 5-^002X02045    CR3* 


ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    '-"ESSAGE: 

ORIGINAL    MESSAGE: 
R  EDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
p EDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
PEDUCED    mess/gE: 

ORIGINAL    MESSAGE: 
PEDUCED   MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

STATISTICS: 

ORIGINAL    MFSSAGE    LENGTH: 
PEDUCED    MESSAGE    LENGTH: 
SAVING:  672    (=    67.6    * ) 


+__(  j_ (     /    ) 

+%002~<%003     )?005-(     /    )?045-UQ02021)-  +  ' 

:         005  124 

:S303    S00205S007    124S")48    1*0'126' 


:       PR4-NG' 
:%002    PB4-N0' 

+—  (  I — +' 

♦  S002-  (?103     JS002-*-' 

:         006         : • 

:*003    %0a2062003    :' 


:?009     :• 

:    TR5-G0X    :         CR5 

:    TRf-*0020X    :*003    CP5S047    CP4' 


-— (0021)-+' 


166' 


< (  )__  4.—  {  ) 

+  %002-<X003     )?002-t-*00  2-U00  3    )  *045- <  XOO202  5  )-♦■ ' 


124  176 

$004    12<-S007    176*048    165' 


CR4" 
(002*)-+' 

165' 


992 
320 
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ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MFSSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
"TTETJOC ED  MESSAGE: 


ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
"RTDUtED  HF5SAGE: 


"ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  "MESSAGE: 


CP3        CR4» 
£00'-  CR3X007  CR4' 

0-_(    ). (  /  )  —  +i 

OS002-<3!003  )*005-(  /  U002-  +  ' 

:    166        165    :* 

:%003  1%00262007  1651003  :• 


:X019  :« 

:         CP5  : 

:X003    CR5X013    :X043    TEMP' 


TEMP' 


< —  (  )„ .„._+. 

+X002-U003    )X012■-+^042-U00203l)-^•• 

:         176 

:X003    17611058    160' 


:         CK3' 
:f003    CR3' 

*--  (  )--+« 

+1002-  U003     )X002-+« 

:         166        :  • 

:X003     1X0026X003    :• 


:X0  09  :« 

:    CP4    :    TEMP      CR7 

:X003  CR4X003  :  X003  TEMPX006  CR7X037  CR5' 


(0031)-+' 


160» 


« (   /   )--+--(      ) {   /   ) 

+X002-(  /  )X002-+X002-(X003  )X005-<  /  )  X035-  ( X002036)-*-' 

:    165        160        163 

:*003    165X007    160*007    163X038    176' 

:  • 
•  i 

:         CR6  CR7« 

:X003   CR6X007   CR7' 

4 (  J (  >__  +  t 

+X002-IX003    )X005-<X003    JX002-+* 

:        173  163        :' 

:X003    17335007    163X003    :• 


CP5* 
(0036I-  +  * 
176* 


0RIGIN7TL    MESSAGE: 
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RE  cur.  ED  MESSAGE:  »:?019 


ORIGINAL  MESSAGE:  «:    CP1 

REDUCED  MESSAGE: 


ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
RECUCED  MESSAGE: 


:*003  CR5?013  :%043  CP6« 


< — (  /  )— *■ -— 

♦  ?002-<  /  )?012-+*042-(?002042)-+' 

:    176  : 

:2003  176S013  :SOV*  173« 


:%0\9 


:?019  :S043  TR1 


:?019  +?0'r2-(?002043)--f' 


S065  120« 


CR6' 
(00' 2  )-  «•' 
173' 

TR1« 
(0043)-+' 

120' 


STATISTICS: 

ORIGINAL  MESSAGE  LENGTH:       1040 
REDUCED  MESSAGE  LENGTH:        328 
SAVING:        720  (=  68.6  *) 
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PMGINAL    MESSAGE: 
RECUCED    MESSAGE: 

OPIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    "ESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

OPIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED   MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

OPIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
nrEDUCED   MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
"TTFOUCfD   MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED   MESSAGE: 

ORIGINAL    MESSAGE: 
ftEDUCTDMESSAGE: 


CR7  CR8' 

*004    CR7*007    CP8' 

+--(  ) (  »--♦• 

+*002- 1*003    )S005-(  35003    )*002~+' 

:  163  172         :• 

:*003    163*007   172*003    :• 


:*019    :• 

:  LS2  :         CR6 

:*003    LS22013    :*003    CR6*037    CR7' 


CR7' 


+ —  (  ) , + — (  )„„ . (0050)-+' 

+X002- 1*003    )*012-+*002-U003    ) *035- ( 1002050)-+ • 


:         007  173 

:*003    *00207*017    173*038   163' 


CR6  TPl-OOX  LS3  TR4-XX0 

:*003    CR6*005    TR 1-*0020X*005    LS3*005    TR4-*002XQ*025   CR8' 


163' 


CR8' 


i — (  , (  > (  ) (     /    )_„ (0055)-+' 

+*002-<*003     )*005-(*003    )*005-I*003    )*005-<     /    )  *025-  (*0020*0025)-+« 


:         173  120  010  123 

:*003    173*007    120*007   010*007    123*028    172' 


:         LS6  CR13  CR12  CP5 

:*003    LS6*007    CR13*006    CR12*006   CR5*027    CR10' 


^ — (  ) (    /    j (    /    j (    /    ) 

+*002-(*003     )*005-(     /    )*005-<    /     )*005-<     /    ) *025-(*002062)-+' 

:         013  164  175  176 

:*003    013*007    164*007    175*007    176*028    171' 


:         PS1' 
:*003    PS1' 

* — (  )--+• 

+*002-(*003    )*002-+' 

:         004        :  • 

:*003    5100204*003    :  ' 


172' 

CR10' 
(0062)-+' 
171' 


ORIGINAL  MESSAGE: 
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REDUCED  MESSAGE: 


•  :*009 


ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCFD  MESSAGE: 

GPIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 

ORIGINAL    MESSAGE: 
REDUCED    MESSAGE: 


CR11      :         CR12  CRIO  CR5 

:£0r>3    CR*002l*OO2    :  *003    CR 12*006    CR 10*006   CR5*027    CR*0021' 


CR11« 


^ — (  )  — + — (     /    ) {  ) (     /     ) — ^ __ (0070)-+' 

+*002-<*003    )*002-+?002-(    /     )*005-<S003     )S005-(     /     )  *02  5-  U002C7r.)-  +  • 


:  17C  175  171  176 

:*003    170*007    175*007    171*007    176*028    170' 


:         PS1  CR11  LSI 

:*003    PS1*007    CP*0021*006    LSl*037    SOLA' 


170' 


SOLA* 


+*002-<    /     )*005-<    /     )*005-(*003    J*035-  <  S00207M-+' 


004        170        003 
*004  *00204*007  170*007  *00203*038  174» 


17^' 


STATISTICS: 

ORIGINAL  MESSAGE  LENGTH: 
REDUCED  MESSAGE  LENGTH: 
SAVING:        657  (=  56.7  *) 


1157 

500 
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ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGIN/L  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
PEOUCEO  MESSAGE: 

ORIGINAL  MESSAGE: 
-RTOuCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCEO  MESSAGE: 

ORIGINAL  MESSAGE: 
"REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL    MESSAGE: 
EDUCTD   MESSAGE: 


PS  I' 
*00<v    PS1' 

+--<  )— +  « 

+  *002-(*003     )*002-*« 

004         :  • 
:*003    100204*003    :• 


:*009    :• 

:         CPU      :       FAULT 

:*003    CR*0021?002    :  *002    FAULT*046    TEMP' 


4 — (  )-.-+—.<  ) ... 

+*002-(*003    )*002~+*002-(*003    )  *045-(  0120020)-  +  ' 

:         170  017 

:*003    170*007    017*048    160' 


:         CP12  PB5' 

:*003    CR12*006    PB5' 

+__(  , (  j  —  +• 

+*002-(*003    )*005-J*003    )*002-+' 

:         175  014         :• 

:*003    175*007    014*003    :• 


TEMP' 
(0100)-+' 

ieo« 


:*019    :• 

:         TEMP  : 

:*003    TEMP*C12    :*043   CR 12' 


^ —  (         ) «. 

+*002-<*003    )*012-+*042-(01041-+' 

:         160 

:*003    160*058   175' 


:*004    WC*057    TR3' 


4 (  , 

♦X002-U003     )*055-(  0106)-+* 

:        167 

:*003    167*058    1*0022' 


CR12' 
(0104)-*' 
175' 

TR3' 
(0106)-*' 


122' 


ORIGINAL    MESSAGE: 
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RECUCED    MESSAGE:  •:• 

ORIGINAL    MESSAGE:  •:         CPU  LS2  WC« 

REDUCFO    MESSAGE:  ':2003    CR200212006   LS22048    WC 

ORIGINAL  MESSAGE:  •*-'-<    ) (    ) :——-.- . (0111)-  + 

REDUCED  MESSAGE:  «+2002-(2003  )2005-(2003  12045-  (  0200  31  )-*•• 

ORIGINAL  MESSAGE:  •     17C        007  167' 

REDUCED  MFSSAGE:  '2004  1702007  2002072048  167' 

STATISTICS: 

ORIGINAL  MESSAGE  LENGTH:        949 
REDUCED  MESSAGE  LENGTH:        280 
SAVING:        669  <=  70.3  2) 
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OFIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

OPIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

OPIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

OPIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCEO  MESSAGE: 

OPIGINAL  MESSAGE: 
~R EDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
R  EDUCED"  ffE5 SAGE: 

WrGlNAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  HE55AGT: 

"ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 


CP  13' 
2004  CR13' 

< — <    ) — +• 

+2002- (2003  J2002-+' 

:    164    :  • 

:2003  1642003  : ■ 


:2009  :• 

:  TP3-00X  :    CR5 

:    TR3-20020X    :2003    CR52047    CR13' 


+ — (  )__+ — (     /    ) — 4 

+2002-UOQ3    )2002-+2002-<    /    ) 2002-+2042-< 0200215 )-  + ' 

:        122  176        : 

:2003    120022*007    1762003    :2044    164« 


:2019    :■ 


:2019    :2043    TR4" 


:2019    +2042- { 0200216)-* • 


:2064    123' 


CR2  CR7 

:2003    CR22007   CR72047    TEMP' 


* — (    ) (  /  ) 

+2002-<2003  )2005-(  /  ) 2045- < 0121)-+' 

:         161  163 

:2003    1612007   1632048    160' 


:    TR4-00X  CR5' 

:    TR4- 20020X2005   CR5' 

i — (  j (         j — +• 

♦2002- (2003    )2005-<2003    12002-+' 

:    123        176    :• 

:2003  1232007  1762003  :• 


CP13' 
(0115)-+' 
164' 

TR4' 

(0116)-+' 
123' 

TEMP' 
(0121)-+' 
160' 
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REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

OPIGINAL  MESSAGE: 
R  EDUCED  "ESSAGE: 

ORIGIN/L  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MESSAGE: 
REDUCED  MESSAGE: 

ORIGINAL  MFSSAGE: 
REDUCED  MESSAGE: 

STATISTICS: 

ORIGINAL    MESSAGE    LENGTH: 
REDUCED    MESSAGE    LENGTH: 
SAVING:  867    (=    71.7 


•:?019    :• 

!         TEMP 

:?003    TEMPS012 


:         SSI 
:S003    .S002SK037   CR2' 


+  *002«U003  )%012-+?002-(*003  ) S002-+%032- ( 0126 J-  + • 


:  16C 

:S003    160*017    016X003 


:S029    :• 

:?029    :S033    TP5» 
:S029    +?032-(0i27}-+, 
*065    12V 


1207 
340 


016         : 
:?034    161' 


CR21 


(0126)-+ 


161' 


TR5 


(0127)-  +  ' 


124' 
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/*  TEXT  EXPANSION  FROM  AN  ENCODED  MESSAGE  */ 

/*  ENCODING  MECHANISM:  CONTIGUOUS  DUPLICATE  CHARACTER  COMPRESSION  */ 

/*  COMPUTER  SCIENCE  389  PROJECT  */ 

/*  ALFRED  C.  WEAVER  */ 


DECODE:  PROC  OPT  IONS { MA  IN) ; 

DCL  CODE  CHAR(80)  VAP  INITCXM,  (F,S)  CHAR(l); 

/*  REPEAT  FOR  EACH  ENCODED  MESSAGE  */ 
DC  WHILE(C0DE  -.=  •  •  )  ; 
GET  LIST  (CODE); 

PUT  SKIP(2)  EDI"!-  ('ENCODED  MESSAGE  IS:»,  CODE, 
•RECONSTRUCTED  MESSAGE  IS:',  »  •) 
(A,  COL(30),  A,  SKIP,  A,  C0L(29),  A); 

1  =  1; 

/*    SCAN    ACROSS    THE    ENTIRE    MESSAGE    */ 
DO    WHILfcCI    <=    LENGTH(CGDE) ); 
F»  SUBSTRICOOEfI.fi'); 

/*    DETERMINE    IF    A    CHARACTER    IS    REPEATED    */ 
IF    F='%*    THEN    DO; 

GET    STRING(SU3STR(C0DE, 1+1,3) )    EDIT    (K)     (F<3)); 
S=SUBSTR(CQDE, 1+4,1); 

/*    REPEAT    THE    CHARACTER     «K«    TIMES    */ 
DO    J=l    TO    k; 

PUT    EDIT    (S)     (A); 
END; 
1  =  1+5; 
END; 

/*  OUTPUT  A  SINGLE  CHARACTER  */ 
ELSE  DO; 

PUT  EDIT  (F)  (A); 
1  =  1+1  ; 
END; 
FMD; 
END; 
END  DECODE; 
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ENCODED  MESSAGE  I?: 
RECONSTRUCTED  MESSAGE  IS: 


ABC*105XA&C 
mPCXXXXXABC 


ENCODED    MESSAGE    TS: 
PECGNSTF'JCTFD    "ESSAGE    IS: 


*0O2A3SOC5e*OlDC*02ODX 
AABbBBBCCCCCCCCCCDDDQDDDDDDDOODDDDDDDX 


ENCODED  MESSAGE  IS: 
RECONSTRUCTED  MESSAGF  IS: 


*010A*010;i 
AAAAAAAAAABBPBR8RBBB 


ENCGDFD  MFSSAGE  I S: 
RECONSTRUCTED  MESSAGE  IS: 


t  8CDEFGHI JKLMNCPQ&STUVWXYZ 
ABCDEFGHI JKLMNOPQPSTUVWXY? 


ENCODED  MESSAGE  IS: 
PECCN5TcuCTtD  MESSAGE  IS: 


XOG-i-    CP  7*007    CRB 
C°7  CPR 


ENCPDFD    MESSAGE    I S: 
RECONSTRUCTED    MESSAGE     IS 


+*G02~<*003     )*00?-<*003     1*002-+ 
♦  — (  ) (  )  —  ♦ 


ENCODFD    MESSAGE    I S: 
RECONSTRUCTED    MESSAGE    IS: 


:*003    163*007    172*003 
:         163  172         : 


ENCODFD  MESSAGE  IS: 
RECONSTRUCTED  MESSAGE  IS: 


:*019 


ENCODED    MFSSAGE    IS: 
RECONSTRUCTED    MESSAGE     IS: 


:*003    LS2S013    :?O03    CR6*037    CR7 
:        LS2  :        CR6 


CR7 


ENCODED    MESSAGE    TS: 
RECONSTRUCTED    MESSAGE     IS: 


+*002-<*003     )*012-+*002-<*003     ) *035- ( *002O50)-+ 
♦__(  , +  — ,  , 


(0050)-* 


ENCODED    MFSSAGE    IS: 
RECONSTRUCTED    MESSAGE     IS 


X003    *C0207*017     173*038    163 
007  1 73 


163 


ENCODED  MESSAGE  IS: 
RECCNSTFUCTFD  MESSAGF  IS: 


ENCODED  MESSAGE  IS: 
RECONSTRUCTED  MESSAGE  IS: 


:*003  CR6*005  tr l-*O02OX*005  LS3*00?  TR4-*002 XC*025  CR8 
CPc      TRl-Cnx      LS3      TRA-XXO 


CP8 


ENCODED    MESSAGE    IS: 
RECONSTRUCTED    MESSAGE     IS: 


+*002-( J003    >*005-<*003    )*005-(*003    )*005-<     /    ) *025- { *0020*002c )-♦ 

«.__(  ) ,  » (  , <     /     ) <0055)-+ 


ENCODED    MESSAGE    IS: 
RECONSTRUCTED    MESSAGE    IS: 


:*003    173X007    120*007    123*028    172 
173  120  123 


172 


ENCODED  MESSAGE  IS: 
RECONSTRUCTED  MESSAGE  IS: 


ENCODED  MESSAGE  IS: 
RECONSTRUCTED  MESSAGE  IS: 

ENCODED  MESSAGE  IS: 
RECONSTRUCTED  MESSAGE  IS: 


:*003    LS6*007    CR13*006    CR12*006    CR5*027    CR10 
:         LS6  CR13  CR12  C P 5 

+  *002-<*003    )*005-<     /    )*005-(     /     l*005-(     /    ) *025- < *002062 »-♦ 
«.__(  ) (    /     , {     /     ) (    /    ) 


CR10 
(00621-  + 


ENCODED    MESSAGE    IS: 
RECONSTRUCTED    MESSAGE     IS: 


*003    013*007    164*007    175*007    176*028    171 
013  164  1/5  i  76 
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171 


Again  it  is  clear  that  a  set  of  phrases  of  S  can  be  generated 
by  enumeration.  Now,  given  a  set  P  of  common  phrases  in  S,  what  do  we 
know  about  them? 

(1)  If  S  is  of  length  N,  then  no  phrase  p  €  p  can  be  of 
length  >  L N/2  J ,  else  it  could  not  be  repeated  and,  hence, 
could  not  be  "common"; 

(2)  A  string  of  length  N  will  have: 

N  phrases  of  length  1, 
(N-l)  phrases  of  length  2, 
(N-2)  phrases  of  length  3, 


and  (N  -  LN/2J  +  l)  phrases  of  length  LN/2J . 

Would  the  replacement  of  all  phrases  p  with  pointers  to  P  yield 
a  minimal  length  S'?  Not  unless  there  is  an  ordering  to  P,  for  replacement 
of  p.-AB'  and  p.=  'ABC,  applied  in  the  order  (i,  J)  to  S='ABCABC',  yields 
S'=(p.  )C(p.  )C,  with  length=J+,  while  the  order  (j,i)  yields  S'=(p.)(p.)  with 


length=2 . 


Suppose  P  were  to  be  ordered  by  length  of  the  p.  such  that  replace- 


ments always  removed  the  longest  phrases  first.   This  is  not  sufficient  since 
p.=  'ABCD'  and  p.=  'CDEAB',  applied  in  the  order  (j,i)  since  |p.|>|p.  |,  to 
S=  'ABCDEABCD ' ,  yields  S '= 'AB(p  .  )CD '  with  length=5,  while  the  order  (i,j) 

J 

yields  S'= ' (p. )E(p. ) '  with  length=3- 

So  length  of  common  phrases  alone  does  not  establish  an  order  on  P. 

Does  frequency  of  usage  affect  the  order  of  P?  In  addition  to  all  phrases 

p.  e  P,  also  keep  a  count  of  how  many  times  that  phrase  appears  in  the 

message,  f . .  Now  order  P  by  decreasing  f.,  i.e.,  largest  number  of  uses  first. 
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This  method  also  fails  since  for 

p.=  'ABC   f.=3 
1         1       applied  to  S= 'ABCXABCYABC ' 

p.=  'XABY'   f.=l 
3  3 

in  the  order  (i,j)  because  f.  >  f .  yields  S'=(p. )X(p. )Y(p. )  with  length=5 
while  the  ordering  (j,i)  yields  S'=(p. ) (p .) (p. )  with  length=3- 

J-    o  ■*■ 

But  now  let  us  combine  these  two  methods  to  account  for  the  effects 
of  both  phrase  length  and  frequency  in  suggesting  a  possible  solution  to 
the  problem  of  how  to  pick  and  order  a  set  P. 

2.3.2.  A  Conjecture 

Given  a  string  of  characters  S  of  length  N,  let  P  be  the  set  of 

phrases  of  S  such  that  each  element  of  P  is  a  phrase  p.,  where  p.  is  the  j 

phrase  of  length  |p.|,  and  p.  has  frequency  of  occurrence  f .  >  2  in  S. 

Then  for  each  phrase  p.  define  a  reduction  factor  r.  =  f.  (Ip.l  -  l) 

3  3  3       '  3  ' 

which  represents  the  number  of  characters  saved  when  |p.|  characters  of 

text  are  replaced  with  one  one-character  phrase  identifier  (pointer)  for  each 

of  the  f .  occurrences  of  p . . 

3  3 

Construct  the  set  of  phrases  P',  which  are  the  p.  sorted  into 
descending  order  by  r  . .  Within  groups  of  phrases  with  equal  r  value,  sort 

J 

again  by  descending  length  of  phrases.  Now  replace,  in  order,  every  occurrence 

of  pi  in  S  with  the  appropriate  reference  pointer  until  all  p '.  have  been 
J  J 

examined,  thus  yielding  a  new  string  of  characters  and/or  phrase  pointers,  S'. 
Then  S'  is  the  minimal  length  text  string  which  can  be  created  from  S. 
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2.3-3-  An  Example 

Consider  the  (arbitrary)  message  'ABCXABCYABCZXABCY ' .  The 
set  of  unique  phrases  appearing  at  least  twice  include: 

phrase  #  phrase 


1 

'ABCY' 

2 

8 

2 

•XABC' 

2 

6 

3 

*ABCY' 

2 

6 

h 

'ABC 

k 

8 

5 

'XAB' 

2 

k 

6 

'BCY' 

2 

k 

7 

'AB* 

1+ 

h 

8 

»BC' 

1+ 

k 

9 

'XA' 

2 

2 

10 

rCY' 

2 

2 

The  ordering  of  replacement  of  the  phrases  with  r=8  is  not 

arbitrary;  the  longest  such  phrase  (XABCY)  must  be  applied  first.  Replacing 

'XABCY'  first  we  derive: 

ABCXABCYABCZXABCY  length=17 

ABC(1)ABCZ(1)  length=9 

(4)(1)(U)Z(1)  length=5 

while  replacing  'ABC'  first  derives: 

ABCXABCYABCZXABCY  length=17 

(k)x(k)Y(k)z(h)  length=7 

The  algorithm  described  has  been  programmed  and  commented  to  accept 
a  message  string,  determine  all  possible  phrases  which  occur  repeatedly, 
sort  them  by  descending  r  values,  and  replace  common  phrases  with  pointers 
into  phrase  table  P. 
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/*  OPTIMAL  PHRASE  EXTRACTION  FROM  TEXT  STRINGS  */ 
/*  COMPUTER  SCIENCE  389  PROJECT  */ 
/*  ALFRED  C.  WEAVER  */ 


/*  AM  ALGORITHM  TO  EXTRACT  SUBPHRASES  FROM  A  MESSAGE  • S»  */ 

/*  SUCH  THAT  REPLACEMENT  OF  SUBPHRASES  WITH  "PHRASE  POINTERS"  */ 

/*  YIELDS  A  MINIMUM  LENGTH  MESSAGE   */ 


PHRASES:  PROC  OPT  IONS ( MA  IN ) ; 

/*  »S'  IS  THE  ORIGINAL  MESSAGE  (CHARACTER  STRING)  */ 
/*  «P«  IS  THE  ARRAY  CF  SUBPHRASES  OF  S  */ 
/*  'F«  IS  THE  ARRAY  OF  FREQUENCY  COUNTS  */ 
/*  «R»  IS  THE  ARRAY  OF  REDUCTION  FACTORS  */ 
DCL  S  CHAF(IOO)  VAR,  ( P ( 500 ), PHRASE )  CHAR(50)  VAR,  CH  CHAR(3), 
(F(500) ,R (500), A, I ,J,K,L,M,N)  FIXED  BIN(31); 

/*  REPEAT  UNTIL  INPUT  STREAM  IS  EXHAUSTED  V 
ON  ENDFILE1 SYSIN)  STOP; 
FOREVER:  DO  WHILEPl'B); 

GET  LIST( S); 

N  =  LENGTH(S); 

K  =  0; 

F  =  0; 

R=0; 

PUT    SKIP(2)     EDITi 'ORIGINAL     STRI NG : • ,  •  • • • , S , • • ■ • , • SUBPHRASE • , 

•FREQUENCY*,  'REDUCTION    FACTOR •  )( A ,COL( 30 ) , 3    A, SKIP, 

A,C0L(20) ,A,C0L(30),A) ; 

/*    SEARCHING    FOP    THE    LONGEST    SUBPHFASE    FIRST    IS    */ 

/*    ESSENTIAL    TO    PREVENT    A    RESOFT     LATER    ON    PHRASE    LENGTH    */ 

DO    L    =    FL0PR(N/2)     TO    2    BY    -l; 

/*    EXTRACT    EACH    OF    THE    SUBPHRASES    OF    LENGTH    'L1     IN    'S«    */ 
DC    I     =     1    TO    N-L*l; 

PHPASE=SUBSTR(S,I ,L) ; 
/*    THIS    LOOP     DISCARDS    COMMON     SUBPHFASES    */ 
/*    THIS    STEP    IS    NOT    ESSENTIAL,    BUT    SAVES    TIME    */ 
/*    IN    THE    REPLACEMENT    STEP    */ 
DO    A    =    1    TO    k; 

IF    PHRASE=P(A)     THEN    GOTO    X; 
END; 
/*    »M«    POINTS    Tn    THE    NEXT    POSSIBLE    OCCURANCE    OF    •PHKASE*    */ 

m=I4-l; 
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IF  M>N  THEN  GOTO  X; 
J=INDEX(  SUBSTB(SfM),  PHRASE); 
/*  DETERMINE  WHETHER  'PHRASE1  OCCURS  AT  LEAST  TWICE  */ 
IF  J>0  /*  FREQUENCY  >  1  */  THEN  DO; 
K=K+l; 

P(K)=PHRASE; 
F(K)=2; 
m=m+j+l-1; 
/*  FIND  ALL  OCCURENCES  OF  'PHRASE*  IN  'S»  */ 
DO  WHILE  (J>0) ; 

IF  M+L  >  N  THEN  GOTO  X; 
J=INDEX(  SUBSTR(StM),  PHRASE); 
IF  J>0   THEN  DO; 
M=M+J+L-l; 
F{K)=F(K)*l; 
END; 
END; 
END; 
X:       END; 
END; 

/*  ESTABLISH  THE  REDUCTION  FACTOR  FOR  EACH  SUBPHRASE  */ 
DC  A  =  1  TO  k; 

P(A)  =  F(A)  *  (LENGTH(P(A))  -  1); 

PLT  SKIP  EDIT  l»",,PIA),"",F(A),R(A)) 

(3  A,COL(20),F(4),COL(30),F(8) ); 

fnd; 
/*  now  sort  the  phrases  into  descending  order  by  r(i)  */ 
/*  a  simple  jump-down  sort  will  do  */ 

DO    1=1    TO    K-l; 

DO    J=I+1    TO    K; 

IF    RIIXR1J)     THEN    /*  INTERCHANGE*/    DO; 
PHRASE=P(I);    L=F(I);    M=P(I); 
P(I)=P(J);    F(I)=F(J);    R(I)=R(J); 
P(J)  =  PHRASE;    F(J)=U    R(J)  =  M; 

end; 
end; 
end; 

/*  MOW  LABEL  AND  PRINT  ALL  SUBPHRASES  IN  THEIR  OPTIMAL  ORDER  */ 
PUT  SKIP<2)  EDIT  ('SUBPHRASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT', 
•PHRASE*   SUBPHRASE',  'FREQUENCY',  * REDUCT ION  FACTOR  '  ) 
(A,  SKIP,  A,  C0L(20),  A,  COL(30),  A); 
DO  A  =  L  TO  K; 

PUT  SKIP  EDIT  (A,  "",  P(A),  "",  F(A),  R(A)J 
(F (5), COL (11) ,3  A,COL(20),F(A),C0L(30), F(8)); 
END; 

/*  REPLACE  ALL  SUBPHRASES  WITH  PHRASE  REFERENCES  */ 
DO  1=1  TO  k; 

DC  WHILE(INDEX(S,P( I) )>0); 
J  =  INDEX(S,P(D)  ; 

PUT    STRING(CH)     EDIT    (I)     (F(3)); 
CH=TPANSLATE(CH,'0' ,'     '); 
(NOSTPINGRANGE):    S  =  SUBSTR ( S , 1 , J~l )     II     •«•     I  I    CH    I  I 
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SUBSTMS,    J+LENGTH(P(  I))); 

END; 

end; 

PUT  SK!P(2)  EDIT  («THE  MINIMAL  STRING  IS:1,  S) 
(A,  COL (30),  A); 

END  FCREVEP; 
END  PHRASES; 
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ORIGINAL  STRING:  •  ABC  XABCYABC7.XABCY* 

SUBPHRASE  FREQUENCY    REDUCTION    FACTOR 

•XABCY*  2  8 

• XABC  •  2  6 

•ABCY*  2  6 

•ABC*  A  8 

•XAB«  2  A 

•BCY»  2  A 

•AB»  A  A 

•  BC»  4  A 

»XA»  2  2 

•CY«  2  2 

SURPHPASES    IN    OPTIMAL    CPDER    FOR    REPLACEMENT 

PHRASE*  SUBPHRASE    FREQUENCY    REDUCTION    FACTOR 

1  «XABCY»  2  8 

2  »ABC*  A  8 

3  »ABCY«  2  6 
A  'XABC1       2  6 

5  «XAB»  2  A 

6  »BCY«  2  A 

7  •AB*  A  A 

8  »BC*  A  A 

9  »XA«  2  2 
10  'CY»         2  2 

THE  MINIMAL  STRING  IS:  *Q02$001£002ZS001 

ORIGINAL  STRING:  'XXXXXXXXXX* 

SUBPHRASE  FREQUENCY  REDUCTION  FACTOR 

•XXXXX*  2  8 

»XXXX«  2  6 

•XXX«  3  6 

•XX«  4  A 

SUBPHPASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 

PHRASE*  SUBPHRASE  FREQUENCY  REDUCTION  FACTOR 

1  'XXXXX»  2  8 

2  «XXXX»  2  6 

3  »XXX»  3  6 
A  •  XXf         A  A 

THE    MINIMAL    STRING    IS:  $0013001 

ORIGINAL  STRING:  « ABCCABCDABCDEF' 

SUBPHRASE  FREQUENCY    REDUCTION    FACTOR 

•ABCD»  3  S 

•RCDA»  2  6 

•CDAB»  2  6 

•DABC«  2  6 

•ABC1  3  6 

•BCD*  3  6 

•CDA»  2  A 

•D*B«  2  A 

•AB»  3  3 

•BC*  3  3 

•CD'  3  3 

•DA»  2  2 
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SUBPHRASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 
PHRASE*   SUBPHRASE  FREQUENCY  REDUCTION  FACTOR 

9 
6 
6 
6 
6 
6 
4 
4 
3 
3 
3 
2 


1 

•ABCD* 

3 

2 

•BCDA1 

2 

3 

•CDAB* 

2 

4 

•DARC 

2 

5 

•ABC* 

3 

6 

•BCD' 

3 

7 

•CDA« 

2 

8 

•DAB' 

2 

9 

•AB« 

3 

10 

•BC« 

3 

11 

»CD» 

3 

12 

«DA» 

2 

THE  MINIMAL  STRING  IS: 


*001*001*001EF 


ORIGINAL  STRING: 

SUBPHRASE 

• ABCDEF* 

•ABCDE' 

•BCDEF» 

1 ABCD» 

• RCDE' 

•CDEF' 

•ABC 

•BCD' 

•CDE« 

•DEF' 

•RST* 

•AB' 

•RC» 

•CD« 

•DE« 

•EF« 

•RS« 

•ST« 


•ABCDEFABCDEFRSTRSTABCRST* 
FREQUENCY  P.  EDUCTICN  FACTOR 

2  10 

2  8 

2  8 

2  6 

2  6 

2  6 

3  6 
2  4 
2  4 

2  4 

3  6 
3  3 
3  3 
2  2 
2  2 

2  2 

3  3 
3  3 


SUBPHRASES  IN  CPTIMAL  ORDER  FOR  PEPLACFMENT 


PHRASE* 

1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 


SURPHRASE 
• ARCOEF* 
• ART)F' 
•RCDEF« 
•  A  B  C  0  ■ 
•RCDE' 
•CDEF* 
•ARC 
•RST' 
•CDE« 
•DEF* 
•BCD' 
•AB» 
•RC 
•PS' 
•ST' 
•EF» 
•CD' 
•DE» 


FREQUENCY 
2 
2 
2 
2 
2 
2 
3 
3 
2 
2 
2 
3 
3 
3 
3 
2 
2 
2 


REDUCTICN 
10 
8 
8 
6 
6 
6 
6 
6 
4 
4 

3 
3 
3 
3 
2 
2 
2 


FACTOR 


THE  MINIMAL  STRING  IS: 


*ooi*oomoB*oo8*oo7*DDa 
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ORIGINAL  STRING:  »XXXXX» 

SUBPHPASE  FREQUENCY  REDUCTICN  FACTOR 

•XX1  2  2 

SUBPHFASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 
PHRASE*   SUBPHPASE  FREQUENCY  REDUCTION  FACTOR 
1      •XX*         2  2 

THE  MINIMAL  STRING  IS:        *001*001X 

ORIGINAL  STRING:  «XXXXXX» 

SUBPHRASE  FREQUENCY  REDUCTICN  FACTOR 
•XXX*                  2  4 

•XX'  2  2 

SUBPHPASES  IN  OPTIMAL  ORDER  FOP  REPLACEMENT 
PHRASE*   SUP  PHRASE  FREQUENCY  REDUCTION  FACTOR 

1  »XXX«        2  4 

2  'XX»         2  2 

THE  MINIMAL  STRING  IS:        $001*001 

ORIGINAL  STRING:  'XXXXXXX* 

SUBPHRASE  FREQUENCY  REDUCTICN  FACTOR 
•XXX1  2  4 

•XX*  3  3 

SUBPHPASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 
PHRASE*   SUBPHRASE  FREQUENCY  REDUCTION  FACTOR 

1  'XXX1        2  4 

2  'XX*         3  3 

THE    MINIMAL    STRING    IS:  35001*00  IX 

ORIGINAL    STRING:  »XXXXXXXX« 

SUBPHRASE  FREQUENCY  REDUCTION    FACTOR 
•XXXX»                                             2  6 

•XXX1  2  4 

•XX»  3  3 

SUBPHPASFS    IN    OPTIMAL    ORDER    FOR    REPLACEMENT 
PHRASE*       SUBPHPASE    FREQUENCY    REDUCTICN    FACTOR 

1  »XXXXf       2  6 

2  »XXX«        2  4 

3  'XX1         3  3 

THE  MINIMAL  STRING  IS:        *00i*001 

ORIGINAL  STRING:  «XXXXXXXXX» 

SUBPHRASE  FREQUENCY  REDUCTICN  FACTOR 
•XXXX1  2  6 

•XXX«  2  4 

•XXf  4  4 

SUBPHPASES  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 
PHRASE*   SUBPHRASE  FREQUENCY  REDUCTION  FACTOR 

1  »XXXX«       2  6 

2  «XXX'        2  4 
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3      'XX'  % 

THE  MINIMAL  STPING  IS: 


S0012001X 


ORIGINAL  STRING: 

SURPHRASE 

• XXXXX1 

•XXXX» 

•XXX* 

«  XX1 


FPEOUENCY 
2 
2 
3 


•XXXXXXXXXX* 

REDUCTION  FACTOR 
8 
6 
6 


3UBPHPASFS  IN  OPTIMAL  ORDER  FOP  REPLACEMENT 


PHRASE* 
1 
2 
3 


SURPHRASE 
• XXXXX' 
•XXXX* 
•XXX1 

«xx* 


FREQUENCY 
2 
2 
3 


REDUCTION 
8 
6 
6 
/. 


FACTOR 


THE  "INI  MA L  STRING  IS 


%  DO  1*00  1 


ORIGINAL  STRING: 

SURPHRASE 

• XXXXX* 

•XXXX* 

•  XXX' 

•  XX* 


FPEOUENCY 
2 
2 
3 


•XXXXXXXXXXX* 
PEDUCTICN  FAC 

8 

6 

6 


'OP 


SUBPHPASFS  IN  OPTIMAL  ORDER  FOR  REPLACEMENT 


PHRASE* 
1 
2 
3 

i 


SUBPHRAS  E 
• XXXXX' 
•XXXX' 
•XXX* 
•  XX' 


FP  EOUENCY 
2 
2 
3 


REDUCTION 
8 
6 
6 
5 


FACTOR 


THE  MINIMAL  STPING  IS: 


*  001? 00  IX 


ORIGINAL  STPING: 

SURPHRASE 

• XXXXXX' 

• XXXXX* 

• XXXX* 

•XXX* 

•  XX* 


FPEOUENCY 
2 
2 
? 
3 
5 


• XXXXXXXXXXXX • 

REDUCTICN  FACTOR 

10 

.  8 

6 

c 


SUBPHPASES    IN    TP^IMAL     OPUEP     FOP    REPLACEMENT 
PHRASE*       SURPHRASE    FPEOUENCY    PEDUCTICN    FACTOR 


1 
2 
3 
4 


•  XXXXXX  • 
•XXXXX' 

• XXXX' 

•  XXX' 
•XX' 


HE    MINIMAL    STRING    IS: 


10 
8 

6 
6 
5 

35001*001 
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ORIGINAL    STRING: 
SUBPHRASE 

—  I  >~ ---• 

-(  I™ ■ 

(  ) • 

—  <  )— ■ 

-(  ) ■ 

(  )— —  * 

) • 

—  (  ) 

-(  ) 

(  1™ 

)-— 

)— 

—  (  ) 

-(  )-« • 

(  ) 

) 

> 

j 

—  (  ) 

-<  )-— 

{  ,_„..,. 

) — 

)- 

) .- 

—  (  )  — 

(  ) ,- 

)  —  —  — 

) 

) 


—  <  )- 

-(  )  — 

(  ) 

) , 

)— 

) 

I 


—  (  ) 

(  )  — 

> 

) 


) 


—  < 

-(  ) 

(  )- 

)  — 

) 

) 

) 


FREQUENCY 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 


•0— (  ) 

REDUCTTCN 
26 
26 
26 
24 
24 
24 
24 
22 
12 
22 
22 
22 
20 
20 
20 
20 
20 
20 
18 
18 
18 
18 
18 
18 
18 
16 
16 
16 
16 
16 
16 
16 
16 
14 
14 
14 
14 
14 
14 
14 
14 
12 
12 
12 
12 
12 
12 
12 
12 
10 
10 
10 
10 
10 
10 
10 


FACTOR 


—  +  t 
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•»                                        2 

10 

•-< 

2 

8 

•(       )• 

2 

8 

1       )-l 

2 

8 

1     l—l 

2 

8 

•    J — • 

2 

8 

1  ) 1 

2 

8 

2 

8 

•  {       • 

2 

6 

•       )  • 

2 

6 

•     )-• 

2 

6 

1   )—1 

2 

6 

f  )•- .-  • 

2 

6 

• 1 

4 

12 

2 

4 

•     )• 

2 

4 

1   )-. 

2 

4 

•)— •• 

2 

4 

1 1 

6 

12 

1..  I 

9 

9 

.    )• 

2 

2 

1  )-. 

2 

2 

SUBPHR4 

SES    IN    OPTIMAL    ORDEP 

FCR    REPLACEMENT 

'PHRASED 

SUBPHRASE    FREQUENCY 

PEDUCTICN    FACTOR 

1 

•  — <          >~- ' 

2 
3 

2 

26 

2 

1  f               !.._..,„.-__...  1 

26 

2 

26 

4 

■--  (          |— —  • 

5 

2 

»-<           ) ■ 

2 

24 

24 

6 

,  (          , , 

2 

24 

7 

2 

24 

8 

•--  <          ) ' 

2 

22 

9 

•-<          ) • 

2 

22 

LO 

•  (           , • 

2 

22 

11 

•           ) 1 

12 

2 

22 

2 

22 

13 

•--<           ) ■ 

2 

2C 

14 

•-(          ) ■ 

2 

2C 

15 

•  (          , . 

2 

2C 

16 

•           ) i 

■s 

2 

2C 

17 

i        ) t 

2 

20 

18 

i     ) i 
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2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 
2 
2 
2 
2 
2 
2 
2 
2 
4 
6 
2 
2 
2 


2C 

18 

18 

18 

18 

18 

18 

18 

16 

16 

16 

16 

16 

16 

16 

16 

14 

14 

14 

14 

14 

14 

14 

14 
12 
12 
12 
12 
12 
12 
12 
12 
12 
12 
10 
1G 
10 


•38- 


55 

i  ) « 

»     2 

56 

i  )—.—._  < 

'     2 

57 

» — - — -  i 

2 

58     1 

»— 1    ' 

•    2 

59 

'-(    )  ' 

2 

60      « 

i— —  i 

9 

61      ' 

i   )— i 

2 

62 

i  )-— i 

2 

63      ' 

i  )  _-.«.  i 

2 

64 

2 

65      « 

'-<    • 

2 

66      « 

»(    )• 

2 

67      « 

i    )-i 

2 

68 

i  )—  i 

2 

69      j 

i  ) 1 

2 

70 

'<    » 

2 

71      ' 

'    )• 

2 

72      ' 

i   )~i 

2 

73      « 

i  )-• 

2 

74      ' 

')--• 

2 

75      ' 

i    f 

2 

76      ' 

•   )• 

2 

77      ' 

1  )  • 

2 

78      ' 

)-  • 

2 

10 

10 

10 

TTJ" 

10 

9 

8 

8 

8 

"8" 
8 
8 
8 
6 
6 
6 
6 

6 
u 

4 
4 
4 
2 
2 


THE  MINIMAL  STRING  IS: 


0S001S0CU060* 


ORIGINAL  STRING: 
SUBPHPASE 

—  _  I 


FREQUENCY    REDUCTICN    FACTOR 

2  12 

2  10 

2  8 

3  S 
A  8 
7  7 


SUBPHPASES  IN  OPTICAL  ORDER  FOR  REPLACEMENT 
PHRASE*   SURPHPASE  FREQUENCY  DEOUCTICN  FACTOR 


1 
2 
3 
4 
5 
6 


_—  • 


THE  MINIMAL  STRING  IS: 
ORIGINAL  STRING: 


2 

12 

2 

10 

3 

9 

2 

8 

A 

8 

7 

7 

$001*001- 

•  *•_(  /  ) 

(   /   ) 4-1 
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2.3-^.  Common  Phrase  Replacement 

Assuming  now  that  we  have  selected  by  hand,  or  generated 
automatically  using  the  method  described,  a  particular  set  of  phrases 
P  which  are  the  set  of  phrases  to  be  used  for  replacement  in  S,  we  can 
now  attack  the  question  of  how  to  best  recognize  and  replace  a  common 
phrase  in  a  general  string.  Clearly,  the  optimal  way  is  not  the  linear 
search  which  was  used  for  convenience  in  the  previous  program.  Let  us 
attach  the  problem  somewhat  mathematically. 

Assume  that  a  set  of  common  phrases  is  given.  Only  those  phrases 
may  be  referenced  (replaced)  within  messages.   The  problem  then  is  to 

discover  for  each  message  that  "parse"  into  nonover lapping  phrases  which 

\ 
minimizes  the  new  message's  length.  Let  a  general  character  string  (message) 

to  be  transmitted  be  described  by  the  following  context-free  grammar: 

<message>  : :=  <phrase  reference>  <message>  | 

<character  string>  <message>  | 
<end  marker> 


<phrase  reference> 
<end  marker> 
<character  string> 
<string> 
<number> 


P  <number> 

=  E 

=  C  <number>  <string> 

(any  string  of  1  to  256  printable  characters) 
(any  integer  in  the  range  0  to  255) 


The  <number>  occurring  in  a  <phrase  reference>  indicates  which  of 
256  phrases  (maximum)  is  intended  for  substitution  (<number>  is  the  index 
into  phrase  table  P) ;  the  <number>  in  <character  string>  is  a  count  of  the 
characters  in  the  <string>,  minus  1. 

Assuming  that  no  string  exceeds  256  characters  in  length,  the  space 

requirements  for  each  component  of  the  message  are: 

phrase  reference      -       2  units 

end  marker  -       1  unit 

character  string      -       2+L,  where  L  is  the  number  of  characters 

in  the  string. 
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This  encoding  scheme  is  not  space  optimal  -  in  fact,  the  storage 
requirements  for  the  character  string  of  length  L  could  be  reduced  to  L,  but  at 
the  cost  of  having  the  decoding  mechanism  examine  every  character  in  the  string 
looking  for  a  phrase  reference  (P)  or  end  marker  (E).   Since  this  is 
supposed  to  be  an  algorithm  for  application  purposes,  this  appeared  to  be 
unacceptably  slow,  hence,  the  addition  of  two  extra  characters  [C  <number>] 
allowing  the  direct  application  of  the  IBM  360  "Move  Characters"  instruction 
to  move  the  string  all  at  once  without  examining  individual  characters. 

An  efficient  algorithm  for  producing  space-optimal  parses  has  been 
developed  by  using  this  strategy. 

Consider  one  message  as  a  simple  character  string.  Number  the 

character  positions  from  1  through  N.   Suppose  that  one  can  compute  the 

function 

f(j)  =  least  space  necessary  to  store  characters 
j,  j'+l,  ...,  N  of  the  given  message  for 
I  <  j  <  N 

Then  f(l)  will  be  the  space-optimal  parse  length  for  the  entire  message. 

Let  P  be  the  set  of  all  phrases.  For  each  p  e  P  let  |p|  =  length 
of  p.   Let  ST(j,p)  be  a  predicate  which  is  true  when  phrase  p  matches 
character  positions  j,  j+1,  ...,  j  4-  |p  I  -  1  of  the  given  message  string. 
ST(j,p)  is  false  when  p  is  not  a  phrase  or  when  p  does  not  match  the  string 
beginning  at  position  j  in  the  message. 

To  define  f(l),  let 

P(I)  =  (p|ST(l,p)) 

F(l)  =  min  (F(l+|p|  +  2,  F(l+l)+l)}   f or  1  <  I  <  N  .  ' 

Assume  by  induction  that  f(j)=F(j),  for  I  <  j  <  N.  Assume  that  phrase 
p  e  P(l)  is  used  in  the  parse  at  I  -  it  will  match  characters  I,  1+1, ..., 
I  +  |p  J  -  1  and  that  storage  space  can  be  reduced  to  two  characters 
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(the  phrase  marker  and  the  phrase  reference  number).  Then  the  remainder 
of  the  message,  characters  I  +  |p|,  I  +  |p |  +  1,  . ..,  N,  will  require 
f(l  +  |p  | )  characters  for  storage.  But  f(l  +  Ipl)  =  F(l  +  Ipl)  by  the 
induction  hypothesis.  Now  assume  that  no  phrase  could  be  used  at  I. 
Then  the  one -character  string  at  I  can  be  stored  followed  by  the  optimal 
parse  of  characters  I  +  1,  1+2,  . ..,  N.   Since  a  one-character  string 
requires  one  character  of  storage,  f(l  +  l)  +  1  =  F(l  +  l)  +  1.  Now 
simply  minimize  all  alternatives  at  each  I  and  set  f(l)  =  F(l). 

Finally,  the  search  for  phrases  in  P  is  accelerated  by  using 
a  "hash  table"  techniques.   Since  this  algorithm  will  incur  a  cost  of  two 
characters  overhead  for  each  <phrase  reference>,  only  Ipl  >  3  are  considered 
for  replacement.  The  characters  I,  I  +  1,  and  I  +  2  of  the  message  are  then 
hashed  to  accelerate  the  search  for  phrases  "beginning  with  that  three 
character  segment . 
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3 •   SUMMARY 

This  study  has  illustrated  the  necessity  of  data  compression 
in  a  telecommunications  environment.  Two  techniques  have  been  presented 
which  accomplish  data  compression  hy  very  different  methods  -  duplicate 
character  compression  and  common  phrase  replacement.  For  the  type  of 
data  under  consideration,  both  work  quite  well. 

The  success  of  the  duplicate  character  compression  method  is  due 
in  large  part  to  the  specific  type  of  data  being  transmitted,  which  did, 
in  fact,  have  many  occurrences  of  contiguous  duplicate  characters.   The 
common  phrase  detection  and  replacement  method  is  more  general,  will  apply 
to  a  large  number  of  situations,  but  incurs  a  much  larger  overhead.   Thus, 
the  method  selected  is,  predictably,  a  strong  function  of  the  transmitted 
me  s  sage . 
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